Every four years, the FIFA World Cup is held and overtakes the media for weeks. In fact, it is due to end in just a few days time. As someone who enjoys it casually, just watching for the fun of the game, I was excited by the possible data analyses that could be performed on the data. What can be said about the playing teams? Are there any trends noticable? How can I examine these things?
The World Cup of 2022 has been shocking in more ways than one: dark horses like Japan and South Korea making it far into the tournament, Morocco being the first African team to make it to semi finals, the heated match between the Netherlands and Argentina team (shown in the image above), just to name a few.
As a computer science student, I thought the hype around the world cup would lend itself well to making my report that much more interesting. Follow along with me as I explore the trends of performance in the World Cup over the years, making discoveries you may expect, and some you may not. We will look at how games played out in this tournament, as well as those in previous years, all the way back to 1930.
Considering that the World Cup is not yet done as of the finalization of this report (December 16th), it is exclusive of the final games.
This project consists of a few sections:
In order to perform data analysis on this dataset, I need the help of some python packages. The following will be used throughout the project. They will be used for the following purposes:
pandas: used to manipulate dataframesnumpy: used to calculate statisticsmatplotlib: used to create plotssklearn: used to model regressionstatsmodels: used to perform hypothesis testingpycountry: usedto translate country names to continent names. I import a specific module, pycountry_convert to do this.! pip install pycountry_convertimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
! pip install pycountry_convert
import pycountry_convert as pc
from sklearn.linear_model import LinearRegression
Requirement already satisfied: pycountry_convert in /opt/conda/lib/python3.9/site-packages (0.7.2) Requirement already satisfied: repoze.lru>=0.7 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (0.7) Requirement already satisfied: pytest-mock>=1.6.3 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (3.10.0) Requirement already satisfied: wheel>=0.30.0 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (0.37.1) Requirement already satisfied: pprintpp>=0.3.0 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (0.4.0) Requirement already satisfied: pytest-cov>=2.5.1 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (4.0.0) Requirement already satisfied: pytest>=3.4.0 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (7.2.0) Requirement already satisfied: pycountry>=16.11.27.1 in /opt/conda/lib/python3.9/site-packages (from pycountry_convert) (22.3.5) Requirement already satisfied: setuptools in /opt/conda/lib/python3.9/site-packages (from pycountry>=16.11.27.1->pycountry_convert) (59.8.0) Requirement already satisfied: packaging in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (21.3) Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (21.4.0) Requirement already satisfied: exceptiongroup>=1.0.0rc8 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.0.4) Requirement already satisfied: tomli>=1.0.0 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.2.2) Requirement already satisfied: iniconfig in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.1.1) Requirement already satisfied: pluggy<2.0,>=0.12 in /opt/conda/lib/python3.9/site-packages (from pytest>=3.4.0->pycountry_convert) (1.0.0) Requirement already satisfied: coverage[toml]>=5.2.1 in /opt/conda/lib/python3.9/site-packages (from pytest-cov>=2.5.1->pycountry_convert) (6.5.0) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.9/site-packages (from packaging->pytest>=3.4.0->pycountry_convert) (3.0.7)
I will be using the FIFA Football World Cup dataset in this report.
It has a few separate datasets depending on whether you want to view data from the previous tournaments (1930-2018), the current one, or general data about the teams. I will be combining two of the provided datasets: matches_1930_2018.csv and matches_2022.csv.
Though there are many fields in this dataset, I will be narrowing them down to the ones I'm interested in analysing:
home_team, away_team - Home and Away Teams,home_score, away_score - Scores,home_xg, away_xg - XG (Expected Goals): this will be expanded on in a later sectionhome_penalty, away_penalty - Penalties,home_yellow_card_long, away_yellow_card_long - Yellow Cards Given,Date,Score,YearIf you want to learn more about the workings of the Fifa World Cup tournaments to better understand the analyses going forward, check out these links:
Using the read_csv function of pandas, I extract the data from both datasets and into a dataframe. This makes the data analysis process extremely simple. If you are interested in learning more about dataframes in pandas, check out the documentation.
After both dataframes are formed, one for past tournaments and one for the current tournament, I will start cleaning the data.
# Dataset of games from 1930 to 2018
fifa_past = pd.read_csv("matches_1930_2018.csv")
fifa_past
| home_team | away_team | home_score | home_xg | home_penalty | away_score | away_xg | away_penalty | home_manager | home_captain | ... | home_penalty_shootout_miss_long | away_penalty_shootout_miss_long | home_red_card | away_red_card | home_yellow_red_card | away_yellow_red_card | home_yellow_card_long | away_yellow_card_long | home_substitute_in_long | away_substitute_in_long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | France | Croatia | 4 | 1.1 | NaN | 2 | 1.1 | NaN | Didier Deschamps | Hugo Lloris | ... | NaN | NaN | NaN | NaN | NaN | NaN | ["27’|1:0|N'Golo Kanté", '41’|2:... | ['90+2’|4:2|Šime Vrsaljko'] | ["54’|2:1|Steven Nzonzi|for N'Golo Kant... | ['71’|4:2|Andrej Kramarić|for Ante Rebi... |
| 1 | Belgium | England | 2 | 1.5 | NaN | 0 | 1.0 | NaN | Roberto Martínez | Eden Hazard | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['90+3’|2:0|Axel Witsel'] | ['52’|1:0|John Stones', '77’|1:0... | ['39’|1:0|Thomas Vermaelen|for Nacer Ch... | ['46’|1:0|Jesse Lingard|for Danny Rose'... |
| 2 | Croatia | England | 2 | 1.7 | NaN | 1 | 0.6 | NaN | Zlatko Dalić | Luka Modrić | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['48’|0:1|Mario Mandžukić', '96’... | ['54’|0:1|Kyle Walker'] | ['95’|1:1|Josip Pivarić|for Ivan Strini... | ['74’|1:1|Marcus Rashford|for Raheem St... |
| 3 | France | Belgium | 1 | 1.7 | NaN | 0 | 0.4 | NaN | Didier Deschamps | Hugo Lloris | ... | NaN | NaN | NaN | NaN | NaN | NaN | ["87’|1:0|N'Golo Kanté", '90+3’|... | ['63’|1:0|Eden Hazard', '71’|1:0... | ['85’|1:0|Steven Nzonzi|for Olivier Gir... | ['60’|1:0|Dries Mertens|for Mousa Dembé... |
| 4 | Sweden | England | 0 | 0.5 | NaN | 2 | 1.0 | NaN | Janne Andersson | Andreas Granqvist | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['88’|0:2|John Guidetti', '90+4’... | ['87’|0:2|Harry Maguire'] | ['65’|0:2|John Guidetti|for Ola Toivone... | ['77’|0:2|Fabian Delph|for Dele Alli', ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 895 | Argentina | France | 1 | NaN | NaN | 0 | NaN | NaN | Francisco Olazar | Manuel Ferreira | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 896 | Yugoslavia | Brazil | 2 | NaN | NaN | 1 | NaN | NaN | Bosko Simonovic | Milutin Ivković | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 897 | Romania | Peru | 3 | NaN | NaN | 1 | NaN | NaN | Octav Luchide | Emerich Vogl | ... | NaN | NaN | NaN | Plácido Galindo · 70 | NaN | NaN | NaN | NaN | NaN | NaN |
| 898 | United States | Belgium | 3 | NaN | NaN | 0 | NaN | NaN | Bob Millar | Tom Florie | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 899 | France | Mexico | 4 | NaN | NaN | 1 | NaN | NaN | Raoul Caudron | Alexandre Villaplane | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
900 rows × 44 columns
# Dataset of games from the 2022 World Cup so far.
fifa_2022 = pd.read_csv("matches_2022.csv")
fifa_2022
| home_team | away_team | home_score | home_xg | home_penalty | away_score | away_xg | away_penalty | home_manager | home_captain | ... | home_penalty_shootout_miss_long | away_penalty_shootout_miss_long | home_red_card | away_red_card | home_yellow_red_card | away_yellow_red_card | home_yellow_card_long | away_yellow_card_long | home_substitute_in_long | away_substitute_in_long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Qatar | Ecuador | 0.0 | 0.3 | NaN | 2.0 | 1.2 | NaN | Félix Sánchez | Hassan Al-Haydos | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['15’|0:0|Saad Al Sheeb', '22’|0... | ['29’|0:1|Moisés Caicedo', '56’|... | ['71’|0:2|Mohammed Waad|for Hassan Al-H... | ['68’|0:2|Jeremy Sarmiento|for Romario ... |
| 1 | England | IR Iran | 6.0 | 2.1 | NaN | 2.0 | 1.4 | NaN | Gareth Southgate | Harry Kane | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ['25’|0:0|Alireza Jahanbakhsh', '48&rsq... | ['70’|4:1|Eric Dier|for Harry Maguire',... | ['20’|0:0|Hossein Hosseini|for Alireza ... |
| 2 | United States | Wales | 1.0 | 0.8 | NaN | 1.0 | 1.5 | NaN | Gregg Berhalter | Tyler Adams | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['11’|0:0|Sergiño Dest', '13’|0:... | ['40’|1:0|Gareth Bale', '45+2’|1... | ['66’|1:0|Brenden Aaronson|for Weston M... | ['46’|1:0|Kieffer Moore|for Daniel Jame... |
| 3 | Senegal | Netherlands | 0.0 | 0.9 | NaN | 2.0 | 0.7 | NaN | Aliou Cissé | Kalidou Koulibaly | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['90+4’|0:1|Nampalys Mendy', '90+6&rsqu... | ['56’|0:0|Matthijs de Ligt'] | ['62’|0:0|Ismail Jakobs|for Abdou Diall... | ['62’|0:0|Memphis|for Vincent Janssen',... |
| 4 | France | Australia | 4.0 | 4.0 | NaN | 1.0 | 0.5 | NaN | Didier Deschamps | Hugo Lloris | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ['55’|2:1|Mitchell Duke', '80’|4... | ['13’|0:1|Theo Hernández|for Lucas Hern... | ['56’|2:1|Jason Cummings|for Mitchell D... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59 | Morocco | Portugal | 1.0 | 1.4 | NaN | 0.0 | 0.9 | NaN | Hoalid Regragui | Romain Saïss | ... | NaN | NaN | NaN | NaN | Walid Cheddira · 90+3 | NaN | ['70’|1:0|Achraf Dari', '90+1’|1... | ['87’|1:0|Vitinha'] | ['57’|1:0|Achraf Dari|for Romain Saïss'... | ['51’|1:0|João Cancelo|for Raphaël Guer... |
| 60 | Argentina | Croatia | 3.0 | 2.3 | NaN | 0.0 | 0.5 | NaN | Lionel Scaloni | Lionel Messi | ... | NaN | NaN | NaN | NaN | NaN | NaN | ['68’|2:0|Cristian Romero', '71’... | ['32’|0:0|Mateo Kovačić', '32’|0... | ['62’|2:0|Lisandro Martínez|for Leandro... | ['46’|2:0|Mislav Oršić|for Borna Sosa',... |
| 61 | France | Morocco | 2.0 | 1.9 | NaN | 0.0 | 0.9 | NaN | Didier Deschamps | Hugo Lloris | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ['27’|1:0|Sofiane Boufal'] | ['65’|1:0|Marcus Thuram|for Olivier Gir... | ['21’|1:0|Selim Amallah|for Romain Saïs... |
| 62 | Croatia | Morocco | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | Argentina | France | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
64 rows × 44 columns
Data cleansing is a messy process of "identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity. Data cleaning is considered a foundational element of the basic data science." It is an integral part of the Data Science Pipeline. To learn more, visit What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling?
I cannot yet join the fifa_past and fifa_2022 dataframes, since I intend to perform some analysis on data from this year's tournament only in one case. Thus, the following cleansing will appear duplicated, since I perform it for both datasets.
Given the tension in this year's World Cup, especially between the Netherlands and Argentina, I was immediately interested in preserving this field to analyze in depth later. However, the dataset lists the yellow card attribute appears to be corrupted or in a form that could not be read properly.
There is some sense to it, though. It appears to be in list form, but given as a string. By parsing it and turning it into a list, then taking the length, we get the number of yellow cards given.
# For each past game...
for ind, game in fifa_past.iterrows():
# If the away team received Yellow Cards...
if pd.notnull(game['away_yellow_card_long']):
# Strip the array brackets, and split on the comma character, then take the length of that list
fifa_past.loc[ind, 'away_yellows'] = len(str((game['away_yellow_card_long'])).strip('][').split(', '))
else:
# If there were no yellow cards given in this game, insert 0
fifa_past.loc[ind, 'away_yellows'] = 0
if pd.notnull(game['home_yellow_card_long']): # Same process for the home team
fifa_past.loc[ind, 'home_yellows'] = len(str((game['home_yellow_card_long'])).strip('][').split(', '))
else:
fifa_past.loc[ind, 'home_yellows'] = 0
# Same process as above, but for the 2022 dataset
for ind, game in fifa_2022.iterrows():
if pd.notnull(game['away_yellow_card_long']): # away
fifa_2022.loc[ind, 'away_yellows'] = len(str((game['away_yellow_card_long'])).strip('][').split(', '))
else:
fifa_2022.loc[ind, 'away_yellows'] = 0
if pd.notnull(game['home_yellow_card_long']): # home
fifa_2022.loc[ind, 'home_yellows'] = len(str((game['home_yellow_card_long'])).strip('][').split(', '))
else:
fifa_2022.loc[ind, 'home_yellows'] = 0
The dataset in its current form provides us with a lot of interesting information, but we can step it up a notch. I am interested in finding out what trends may be prevalent when we look at the continent that each team is from. To that end, I use the pycountry_convery module installed in the beginning to convert team/country names to continent names. This must be done for both the home and away team.
However, not all team names are the name of the country from which they live. In those cases, pycountry_convert will throw a KeyError. For the time being, I will catch that error by placing 'TBD' (to be determined) in the field.
# Converts given team/country name to the continent that the team is from
def country_to_continent(country_name):
country_a2 = pc.country_name_to_country_alpha2(country_name) # 2 letter code
country_continent_code = pc.country_alpha2_to_continent_code(country_a2)
country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
return country_continent_name
# For each past game...
for ind, game in fifa_past.iterrows():
# Try using pycountry_convert to provide the continent of that team, and assign it if possible
try:
fifa_past.loc[ind, 'home_continent'] = country_to_continent(game['home_team'])
except KeyError:
# Label teams not named after their country with TBD
fifa_past.loc[ind, 'home_continent'] = 'TBD'
# Repeating the process for the away team
try:
fifa_past.loc[ind, 'away_continent'] = country_to_continent(game['away_team'])
except KeyError:
fifa_past.loc[ind, 'away_continent'] = 'TBD'
# Same process as above for the 2022 dataset
for ind, game in fifa_2022.iterrows():
try:
fifa_2022.loc[ind, 'home_continent'] = country_to_continent(game['home_team'])
except KeyError:
fifa_2022.loc[ind, 'home_continent'] = 'TBD'
try:
fifa_2022.loc[ind, 'away_continent'] = country_to_continent(game['away_team'])
except KeyError:
fifa_2022.loc[ind, 'away_continent'] = 'TBD'
There are quite a few teams that aren't named after their country, or are named slightly differently, or may no longer exist at all! That is to be expected since the World Cup has been around since the 1930s. There are several teams and regions that no longer go by the names they used to.
For each of these teams, I personally assign them to the continent they belong, or belonged, to. I then go back through looking for every TBD field and assign the proper continent to the respective team.
# Mini dataframes for both home and away teams with TBD in the column field
home_tbd = fifa_past[fifa_past['home_continent'] == 'TBD']
away_tbd = fifa_past[fifa_past['away_continent'] == 'TBD']
# This is the list of teams from countries that no longer exist, or go by have different names
tbd = away_tbd['away_team'].unique()
# Using the above variable, I assign each team to the continent they are from
conts = {'Europe': ['England', 'Wales', 'Scotland', 'Germany DR', 'Soviet Union', 'Yugoslavia', 'Czechoslovakia',
'FR Yugoslavia', 'West Germany', 'Republic of Ireland', 'Northern Ireland', 'Serbia and Montenegro'],
'Asia': ['Korea Republic', 'Dutch East Indies', 'IR Iran', 'Korea DPR', 'Türkiye', 'China PR'], 'Africa': ['Zaire']}
# For each game...
for ind, game in fifa_past.iterrows():
# If the continent is not listed for this team
if game['home_continent'] == 'TBD':
# Looking for the continent that has this team's name in it
for c in conts:
if game['home_team'] in conts[c]:
fifa_past.loc[ind, 'home_continent'] = c # Assigning the proper continent to the team
# Same process for the away TBD teams
for ind, game in fifa_past.iterrows():
if game['away_continent'] == 'TBD':
for c in conts:
if game['away_team'] in conts[c]:
fifa_past.loc[ind, 'away_continent'] = c
Below, the process is repeated to assign a continent to the team names that do not yet have one.
# Similar to above, but for the 2022 dataset
for ind, game in fifa_2022.iterrows():
if game['home_continent'] == 'TBD':
for c in conts:
if game['home_team'] in conts[c]:
fifa_2022.loc[ind, 'home_continent'] = c
for ind, game in fifa_2022.iterrows():
if game['away_continent'] == 'TBD':
for c in conts:
if game['away_team'] in conts[c]:
fifa_2022.loc[ind, 'away_continent'] = c
The current dataset does not take into account the winner of any given game. I think it would be interesting to perform analysis on in the future, so we are going to create that column ourselves. The winner of a game will be calculated in one of three ways:
Draws were used in the eliminatories to determine which teams in a group would advance to the round of the final 16. After making it into the final 16, there are no draws, and the stakes of the game increase.
# For each game...
for ind, game in fifa_past.iterrows():
# If there is a tie...
if game['home_score'] == game['away_score']:
# If there were no penalty shootouts, the game was a DRAW
if pd.isnull(game['home_penalty']):
win_cont = 'DRAW'
win = 'DRAW'
# Else, the winning team is the team with the most penalties scored
elif game['home_penalty'] > game['away_penalty']:
win_cont = game['home_continent']
win = game['home_team']
else:
win_cont = game['away_continent']
win = game['away_team']
# If there was no tie, the winning team is the team with the most points scored
else:
if game['home_score'] > game['away_score']:
win_cont = game['home_continent']
win = game['home_team']
else:
win_cont = game['away_continent']
win = game['away_team']
fifa_past.loc[ind, 'winner_cont'] = win_cont # Assigning the name/cont of the winning team
fifa_past.loc[ind, 'winner'] = win
# Repeat for the 2022 dataset
for ind, game in fifa_2022.iterrows():
if game['home_score'] == game['away_score']:
if pd.isnull(game['home_penalty']):
win_cont = 'DRAW'
win = 'DRAW'
elif game['home_penalty'] > game['away_penalty']:
win_cont = game['home_continent']
win = game['home_team']
else:
win_cont = game['away_continent']
win = game['away_team']
else:
if game['home_score'] > game['away_score']:
win = game['home_team']
win_cont = game['home_continent']
else:
win_cont = game['away_continent']
win = game['away_team']
fifa_2022.loc[ind, 'winner_cont'] = win_cont
fifa_2022.loc[ind, 'winner'] = win
The dataset came with 44 fields of data included, but we don't want to keep too many of those. I narrow down the fields to the ones I am most interested in, including the ones I created above. This includes scores, penalties, expected goals, year, and yellow card information.
# Columns I'm interested in
cols = ['home_team', 'away_team', 'home_score', 'home_xg', 'home_penalty', 'away_score', 'away_xg', 'away_penalty',
'Score', 'Year', 'home_yellow_card_long', 'away_yellow_card_long']
# Columns I created
created = ['away_yellows', 'home_yellows', 'home_continent', 'away_continent', 'winner', 'winner_cont']
# Removing the detailed descriptions of yellow cards
yellow_long = ['home_yellow_card_long', 'away_yellow_card_long']
# Selecting only the fields in 'col' as well as the ones I created
new_cols = [attr for attr in (cols + created) if attr not in yellow_long]
fifa_past = fifa_past[new_cols]
fifa_2022 = fifa_2022[new_cols] # Reducing the dimensionality of the dataset.
fifa_past
| home_team | away_team | home_score | home_xg | home_penalty | away_score | away_xg | away_penalty | Score | Year | away_yellows | home_yellows | home_continent | away_continent | winner | winner_cont | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | France | Croatia | 4 | 1.1 | NaN | 2 | 1.1 | NaN | 4–2 | 2018 | 1.0 | 2.0 | Europe | Europe | France | Europe |
| 1 | Belgium | England | 2 | 1.5 | NaN | 0 | 1.0 | NaN | 2–0 | 2018 | 2.0 | 1.0 | Europe | Europe | Belgium | Europe |
| 2 | Croatia | England | 2 | 1.7 | NaN | 1 | 0.6 | NaN | 2–1 | 2018 | 1.0 | 2.0 | Europe | Europe | Croatia | Europe |
| 3 | France | Belgium | 1 | 1.7 | NaN | 0 | 0.4 | NaN | 1–0 | 2018 | 3.0 | 2.0 | Europe | Europe | France | Europe |
| 4 | Sweden | England | 0 | 0.5 | NaN | 2 | 1.0 | NaN | 0–2 | 2018 | 1.0 | 2.0 | Europe | Europe | England | Europe |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 895 | Argentina | France | 1 | NaN | NaN | 0 | NaN | NaN | 1–0 | 1930 | 0.0 | 0.0 | South America | Europe | Argentina | South America |
| 896 | Yugoslavia | Brazil | 2 | NaN | NaN | 1 | NaN | NaN | 2–1 | 1930 | 0.0 | 0.0 | Europe | South America | Yugoslavia | Europe |
| 897 | Romania | Peru | 3 | NaN | NaN | 1 | NaN | NaN | 3–1 | 1930 | 0.0 | 0.0 | Europe | South America | Romania | Europe |
| 898 | United States | Belgium | 3 | NaN | NaN | 0 | NaN | NaN | 3–0 | 1930 | 0.0 | 0.0 | North America | Europe | United States | North America |
| 899 | France | Mexico | 4 | NaN | NaN | 1 | NaN | NaN | 4–1 | 1930 | 0.0 | 0.0 | Europe | North America | France | Europe |
900 rows × 16 columns
fifa_2022
| home_team | away_team | home_score | home_xg | home_penalty | away_score | away_xg | away_penalty | Score | Year | away_yellows | home_yellows | home_continent | away_continent | winner | winner_cont | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Qatar | Ecuador | 0.0 | 0.3 | NaN | 2.0 | 1.2 | NaN | 0–2 | 2022 | 2.0 | 4.0 | Asia | South America | Ecuador | South America |
| 1 | England | IR Iran | 6.0 | 2.1 | NaN | 2.0 | 1.4 | NaN | 6–2 | 2022 | 2.0 | 0.0 | Europe | Asia | England | Europe |
| 2 | United States | Wales | 1.0 | 0.8 | NaN | 1.0 | 1.5 | NaN | 1–1 | 2022 | 2.0 | 4.0 | North America | Europe | DRAW | DRAW |
| 3 | Senegal | Netherlands | 0.0 | 0.9 | NaN | 2.0 | 0.7 | NaN | 0–2 | 2022 | 1.0 | 2.0 | Africa | Europe | Netherlands | Europe |
| 4 | France | Australia | 4.0 | 4.0 | NaN | 1.0 | 0.5 | NaN | 4–1 | 2022 | 3.0 | 0.0 | Europe | Oceania | France | Europe |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59 | Morocco | Portugal | 1.0 | 1.4 | NaN | 0.0 | 0.9 | NaN | 1–0 | 2022 | 1.0 | 2.0 | Africa | Europe | Morocco | Africa |
| 60 | Argentina | Croatia | 3.0 | 2.3 | NaN | 0.0 | 0.5 | NaN | 3–0 | 2022 | 2.0 | 2.0 | South America | Europe | Argentina | South America |
| 61 | France | Morocco | 2.0 | 1.9 | NaN | 0.0 | 0.9 | NaN | 2–0 | 2022 | 1.0 | 0.0 | Europe | Africa | France | Europe |
| 62 | Croatia | Morocco | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2022 | 0.0 | 0.0 | Europe | Africa | Morocco | Africa |
| 63 | Argentina | France | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2022 | 0.0 | 0.0 | South America | Europe | France | Europe |
64 rows × 16 columns
Now that all the cleansing has been done to both datasets, it's time to combine them. This is done easily with a function from pandas called concat. Taking two dataframes as parameters, it returns one that has the data of both of them.
I also take this opportunity to add one more field: yellows. This field sums the number of yellow cards received by the home and away teams. It is useful for analyzing the number of yellow cards given in a single game.
# Combined dataset for the Fifa World Cups
fifa = pd.concat([fifa_2022[pd.notnull(fifa_2022['home_score'])], fifa_past]).reset_index(drop=True)
fifa['yellows'] = fifa['home_yellows'] + fifa['away_yellows']
fifa
| home_team | away_team | home_score | home_xg | home_penalty | away_score | away_xg | away_penalty | Score | Year | away_yellows | home_yellows | home_continent | away_continent | winner | winner_cont | yellows | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Qatar | Ecuador | 0.0 | 0.3 | NaN | 2.0 | 1.2 | NaN | 0–2 | 2022 | 2.0 | 4.0 | Asia | South America | Ecuador | South America | 6.0 |
| 1 | England | IR Iran | 6.0 | 2.1 | NaN | 2.0 | 1.4 | NaN | 6–2 | 2022 | 2.0 | 0.0 | Europe | Asia | England | Europe | 2.0 |
| 2 | United States | Wales | 1.0 | 0.8 | NaN | 1.0 | 1.5 | NaN | 1–1 | 2022 | 2.0 | 4.0 | North America | Europe | DRAW | DRAW | 6.0 |
| 3 | Senegal | Netherlands | 0.0 | 0.9 | NaN | 2.0 | 0.7 | NaN | 0–2 | 2022 | 1.0 | 2.0 | Africa | Europe | Netherlands | Europe | 3.0 |
| 4 | France | Australia | 4.0 | 4.0 | NaN | 1.0 | 0.5 | NaN | 4–1 | 2022 | 3.0 | 0.0 | Europe | Oceania | France | Europe | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 957 | Argentina | France | 1.0 | NaN | NaN | 0.0 | NaN | NaN | 1–0 | 1930 | 0.0 | 0.0 | South America | Europe | Argentina | South America | 0.0 |
| 958 | Yugoslavia | Brazil | 2.0 | NaN | NaN | 1.0 | NaN | NaN | 2–1 | 1930 | 0.0 | 0.0 | Europe | South America | Yugoslavia | Europe | 0.0 |
| 959 | Romania | Peru | 3.0 | NaN | NaN | 1.0 | NaN | NaN | 3–1 | 1930 | 0.0 | 0.0 | Europe | South America | Romania | Europe | 0.0 |
| 960 | United States | Belgium | 3.0 | NaN | NaN | 0.0 | NaN | NaN | 3–0 | 1930 | 0.0 | 0.0 | North America | Europe | United States | North America | 0.0 |
| 961 | France | Mexico | 4.0 | NaN | NaN | 1.0 | NaN | NaN | 4–1 | 1930 | 0.0 | 0.0 | Europe | North America | France | Europe | 0.0 |
962 rows × 17 columns
All the cleaning is over and done with. Now, we can start with data analysis. The first thing I would like to do is example the summary statistics of a few fields. Summary statistics are useful for getting a glance at the distribution of the data without creating any plots first.
Namely, we will look at summary stats for home_score, away_score, and yellows. This process is made simple by using the numpy provided functions to calculate mean, median, minimum, maximum, and standard deviation.
To learn more about summary statistics and their uses, check out this link on Summary statistics.
# Summary Stats for Home Score
mean = np.mean(fifa['home_score'])
median = np.median(fifa['home_score'])
mini = np.min(fifa['home_score'])
maxi = np.max(fifa['home_score'])
stddev = np.std(fifa['home_score'])
print('Summary Stats for HOME SCORE')
print('----------------------------------')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
Summary Stats for HOME SCORE ---------------------------------- Mean: 1.7754677754677755 Std Dev: 1.597002467430215 Median: 1.0 Min: 0.0 Max: 10.0
# Summary Stats for Away Score
mean = np.mean(fifa['away_score'])
median = np.median(fifa['away_score'])
mini = np.min(fifa['away_score'])
maxi = np.max(fifa['away_score'])
stddev = np.std(fifa['away_score'])
print('Summary Stats for AWAY SCORE')
print('----------------------------------')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
Summary Stats for AWAY SCORE ---------------------------------- Mean: 1.0426195426195426 Std Dev: 1.0699314095569636 Median: 1.0 Min: 0.0 Max: 7.0
# Summary Stats for Yellows Combined
mean = np.mean(fifa['away_yellows'] + fifa['home_yellows'])
median = np.median(fifa['away_yellows'] + fifa['home_yellows'])
mini = np.min(fifa['away_yellows'] + fifa['home_yellows'])
maxi = np.max(fifa['away_yellows'] + fifa['home_yellows'])
stddev = np.std(fifa['away_yellows'] + fifa['home_yellows'])
print('Summary Stats for YELLOWS COMBINED')
print('----------------------------------')
print('Mean: ', mean, '\nStd Dev: ', stddev, '\nMedian: ', median, '\nMin: ', mini, '\nMax: ', maxi)
Summary Stats for YELLOWS COMBINED ---------------------------------- Mean: 2.6715176715176714 Std Dev: 2.347688646183192 Median: 2.0 Min: 0.0 Max: 17.0
For this next section, we will be visualising the data using plots, courtesy of the matplotlib library. Each plot will seek to answer a question involving one or more fields in the dataset. Our goal here is to see if we notice any trends.
This first question seeks to know whether or not scores have changed over time. Preliminarily, I will use a scatter plot to display this data. The x axis will be the year of the game, and the y axis will have the number of points scored in the game. This information will be split into two plots, one for the home team, and one for the away team. Having two graphs side by side is something that can be done using the very helpful function subplots.
Using subplots is more advantageous than using subplot since we can standardize the length of the y axis across both plots. Normally, a plot will have a y axis that is approximately the range of the data. Since the range of points scored in this case is different between teams, it would be misleading to see the two graphs with different y ranges. Using subplots fixes that.
# Graphing year by points scored, using helpful function subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
fig.suptitle('Scatter Plot of Home Scores/Away Scores over Time')
ax1.scatter(fifa['Year'], fifa['home_score'])
ax2.scatter(fifa['Year'], fifa['away_score'])
# Labelling each plot
ax1.set_xlabel("Year")
ax1.set_ylabel("Points Scored")
ax1.set_title("Home")
ax2.set_xlabel("Year")
ax2.set_ylabel("Points Scored")
ax2.set_title("Away")
plt.show()
It looks like a scatter plot is not the best plot to use for the data. As you can see, the values of home_score and away_score are always whole numbers. It's also impossible to tell which amount of points scored has the highest concentration. To improve this idea, we can use another plot: the Violin Plot
The above idea had prospect, but a scatter plot is not the best plot to use here. Instead, a violin plot works similarly, yet gives more information. It displays the distribution of the variable in the y-axis for each value of the variable in the x-axis. A violin plot also provides the mean value for the distribution as a horizontal line through the 'violin'.
# For each year, we need to collect the scores of that year into an array (one for home and one for away teams)
years = sorted(list(fifa['Year'].unique()))
home_scores = []
away_scores = []
# For each year, append a list of the scores for each team
for y in years:
home_scores.append( list(fifa[fifa['Year'] == y]['home_score']) )
away_scores.append( list(fifa[fifa['Year'] == y]['away_score']) )
# Using subplots to create two plots, side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6), sharey=True)
fig.suptitle('Distribution of Home Scores/Away Scores over Time')
ax1.violinplot(home_scores, years, widths=4,showmeans=True)
ax2.violinplot(away_scores, years, widths=4,showmeans=True)
# Labelling
ax1.set_xlabel("Year")
ax1.set_ylabel("Points Scored")
ax1.set_title("Home")
ax2.set_xlabel("Year")
ax2.set_ylabel("Points Scored")
ax2.set_title("Away")
plt.show()
With a violin plot, it is much clearer how the data is distributed each year. It seems that there is a different between the distributions for teams that are listed as home or away. Perhaps it's the case that even being labelled as an away team has a negative effect on performance? It also seems that home teams have scored less and less over time.
You might also notice that there is a gap in data between 1940 and 1950. You may not have thought about it before now, but those years line up with the timeline of World War II. It then makes a lot of sense that we weren't holding a world tournament at a time like that!
For this question, a violin plot is also very effective. We want to see the distribution of yellow cards received each year, since their invention. Yellow (and red) cards started being used in the 1970 World Cup, although a similar mechanism of eliminating players was used in the 1950 tournament, for example. See this wikipedia article for more information.
For this question, it might be more effective to examine total yellow cards given in a game, rather than per home or away team.
# List of yellows
yellows = []
# For each year...
for y in years:
curr = fifa[fifa['Year'] == y]
yellows.append( list(curr['yellows']) ) # Append a list of the number of yellow cards given that year
fig, ax = plt.subplots(figsize=(15, 7))
ax.violinplot(yellows, years, widths=4,showmeans=True) # Creating the violin plot
# Labelling
ax.set_xlabel("Year")
ax.set_ylabel("Yellow Cards Given")
ax.set_title("Distribution of Yellow Cards over Time")
plt.show()
It is quite apparent that the maximum number of yellow cards given in games, since the conception of the idea, has drastically increased over time. From a minimum of 0 to a maximum of 17 in a single game! That number belongs to the heated game between the Netherlands and Argentina.
It is also true that the mean number of yellow cards given has increased over time (according to the horizontal line of each violin), though not perfectly.
Using the winner_cont field, we can count the number of wins for each continent over the entire history of the World Cup. For this, we will be removing every game that was a DRAW.
I prefer using a horizontal bar graph here because of the conservation of space. Since the counts are quite large, it's a better idea to have them extend sideways than upwards.
# Winners are counted per continent, and converted to a dictionary
winners = fifa.groupby(by='winner_cont')['winner'].agg('count').to_dict()
winners.pop('DRAW') # removing games that ended with a DRAW
# Keys: number of wins
# Values: the continents themselves
x_axis = list(winners.keys())
y_axis = list(winners.values())
clrs = ['#59bfff', '#bfe6ff', '#0da2ff', '#8cd3ff', '#d9f1ff', '#26abff'] # a range of blues
# Plotting with a horizonal bar graph
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(x_axis, y_axis, color=clrs)
# Labelling
ax.set_xlabel("Games Won")
ax.set_ylabel("Continent")
ax.set_title("Bar Graph of Wins per Continent")
plt.show()
Now that the semi finals are underway, the top 4 teams of this years World Cup are: Morocco and Croatia (competing for third place), France, and Argentina. I will create a line graph to depict the number of games these teams won in the past years. This is to get a sense for whether it may have been unexpected for these teams to make it this far, or if it was likely based on past games.
# Leading 4 teams of the 2022 World Cup
leads = ['Morocco', 'Argentina', 'Croatia', 'France']
plt.figure(figsize=(15, 7))
# For each team
for l in leads:
# A dataframe of the wins for a specific team
curr = fifa_past[fifa_past['winner'] == l].copy()
# Grouping by year to count games won
per_year = curr.groupby(by='Year')['winner_cont'].agg('count')
plt.plot(per_year, label = l) # Plotting, with a label for each line
plt.legend()
# Labelling
plt.title('Wins by the 2022 Top 4 Teams from 1930-2018')
plt.xlabel('Year')
plt.ylabel('Games Won')
plt.ylim([0, 8])
plt.show()
Since I am not analyzing the draws of any team, these numbers imply that the teams made it past the eliminaries and into the final 16 bracket.
It seems that Argentina and France have done well in the World Cup since very early on, and continue to do well. Croatia is also a team that does well in the World Cup historically (the country was formed in 1991, so the track record is short). Meanwhile, Morocco is a country that has not done as well throughout the tournaments, only winning one game in the round of 16. Despite this, it has managed to secure a spot as top 4 teams this year.
# CONTINENT SHIET
# Which continent scores most
# ax = fifa.boxplot(column=['home_score', 'away_score'], by='home_continent', figsize=(16, 7), grid=False)
# plt.suptitle('Box Plot of Home Scores/Away Scores by Continent')
# plt.show()
This final segment will use pie charts and focus on the trend between continents and points earned, as well as with yellow cards given.
We start with continents v.s. yellow card distribution. This is done by grouping wins per continent and summing them up. For each team, whether they were home or away, the number of yellow cards is counted up. In honor of the fact that we are plotting information about yellow cards, the pie chart is colored with shades of yellow.
# Set of continent names
labels = tuple(list(fifa['home_continent'].unique()))
sizes = [] # to be the sum of yellow cards earned
# For each continent...
for l in labels:
# Add the number of cards earned by a team when they were 'home' + the cards earned by that team when they were 'away'
sizes.append( sum(fifa[fifa['home_continent'] == l]['home_yellows'] )
+ sum(fifa[fifa['away_continent'] == l]['away_yellows'] ) )
# Plotting, labelling
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Yellow Cards Given by Continent')
clrs = ['#e6cc00', '#e47200', '#e6cc00', '#e6b400', '#e69b00', '#e8e337'] # a range of yellow values
# Creating the pie chart
ax.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# Score by continent
scores = []
for l in labels:
scores.append( sum(fifa[fifa['home_continent'] == l]['home_score'] )
+ sum(fifa[fifa['away_continent'] == l]['away_score'] ) )
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Points Scored by Continent')
clrs = ['#bfe6ff', '#0da2ff', '#8cd3ff', '#59bfff', '#26abff', '#d9f1ff'] # a range of blue values
ax.pie(scores, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# how many teams per country?
home_teams = fifa.groupby('home_continent')['home_team'].agg('count')
home_hash = home_teams.to_dict()
away_teams = fifa.groupby('away_continent')['away_team'].agg('count')
away_hash = away_teams.to_dict()
# Yellows by Continent with respect to number of teams
# Pie chart:
labels = tuple(list(fifa['home_continent'].unique()))
sizes = []
for l in labels:
sizes.append( ( sum(fifa[fifa['home_continent'] == l]['home_yellows']) / home_hash[l] )
+ ( sum(fifa[fifa['away_continent'] == l]['away_yellows']) / away_hash[l] ) )
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Yellow Cards Given by Continent per Team on Average')
clrs = ['#e6b400', '#e8e337', '#e6cc00', '#e47200','#e5de00', '#e69b00'] # a range of yellow values
ax.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
# Score by continent with respect to number of teams
scores = []
for l in labels:
scores.append( ( sum(fifa[fifa['home_continent'] == l]['home_score']) / home_hash[l] )
+ ( sum(fifa[fifa['away_continent'] == l]['away_score']) / away_hash[l] ) )
fig, ax = plt.subplots(figsize=(6, 6))
fig.suptitle('Composition of Points Scored by Continent per Team on Average')
clrs = ['#bfe6ff', '#26abff', '#8cd3ff', '#59bfff', '#0da2ff', '#d9f1ff'] # a range of blue values
ax.pie(scores, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90, colors=clrs)
ax.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
fifa_1950 = fifa[fifa['Year'] >= 1950]
fifa_1950
# yellows depend on year? answer YES see p val < 0.05
summ = smf.ols(formula = 'yellows ~ Year', data=fifa_1950).fit()
print(summ.summary())
X = np.array(fifa_1950['Year']).reshape((-1, 1))
y = fifa_1950['yellows']
# Fitting a model using X and y.
reg = LinearRegression().fit(X, y)
pred = reg.predict(X)
# # Extracting the coefficient and intercept to give the formula (y = mx + b) of the model.
intercept = reg.intercept_
coef = reg.coef_[0]
# The formula for the model is printed below.
print('Model: y =', coef, 'x +', intercept)
years = list(fifa_1950['Year'].unique())
plt.figure(figsize=(13, 9))
plt.scatter(fifa_1950['Year'], fifa_1950['yellows'], s=10)
plt.plot(X, pred, color='r')
plt.title('Yellow Cards Earned v.s. Year')
plt.xlabel('Year')
plt.ylabel('Yellow Cards Given')
plt.show()
# does my score depend on my continent
home_home = smf.ols(formula = 'home_score ~ home_continent', data=fifa).fit()
print(home_home.summary())
# does my score depend on opponents continent
home_away = smf.ols(formula = 'home_score ~ away_continent', data=fifa).fit()
print(home_away.summary())
# does my xg depend on my opponents continent?
home_xg = smf.ols(formula = 'home_xg ~ away_continent', data=fifa).fit()
print(home_xg.summary())